Using Machine Learning to Predict Unplanned Hospital Utilization and Chemotherapy Management From Patient-Reported Outcome Measures

PURPOSE Adverse effects of chemotherapy often require hospital admissions or treatment management. Identifying factors contributing to unplanned hospital utilization may improve health care quality and patients' well-being. This study aimed to assess if patient-reported outcome measures (PROMs) improve performance of machine learning (ML) models predicting hospital admissions, triage events (contacting helpline or attending hospital), and changes to chemotherapy. MATERIALS AND METHODS Clinical trial data were used and contained responses to three PROMs (European Organisation for Research and Treatment of Cancer Core Quality of Life Questionnaire [QLQ-C30], EuroQol Five-Dimensional Visual Analogue Scale [EQ-5D], and Functional Assessment of Cancer Therapy-General [FACT-G]) and clinical information on 508 participants undergoing chemotherapy. Six feature sets (with following variables: [1] all available; [2] clinical; [3] PROMs; [4] clinical and QLQ-C30; [5] clinical and EQ-5D; [6] clinical and FACT-G) were applied in six ML models (logistic regression [LR], decision tree, adaptive boosting, random forest [RF], support vector machines [SVMs], and neural network) to predict admissions, triage events, and chemotherapy changes. RESULTS The comprehensive analysis of predictive performances of the six ML models for each feature set in three different methods for handling class imbalance indicated that PROMs improved predictions of all outcomes. RF and SVMs had the highest performance for predicting admissions and changes to chemotherapy in balanced data sets, and LR in imbalanced data set. Balancing data led to the best performance compared with imbalanced data set or data set with balanced train set only. CONCLUSION These results endorsed the view that ML can be applied on PROM data to predict hospital utilization and chemotherapy management. If further explored, this study may contribute to health care planning and treatment personalization. Rigorous comparison of model performance affected by different imbalanced data handling methods shows best practice in ML research.


INTRODUCTION
Cancer treatment side effects frequently negatively affect patients' health and often cause emergency hospitalization. 1,2nplanned health care utilization can be detrimental for patients' physical and emotional well-being and can reduce health care quality through burdening health care systems. 3arly identification of factors contributing to acute hospital presentations can support planning for emergency admissions, increase the quality of care, and reduce health care costs. 2,4Predicting the risk of chemotherapy-related hospital utilization could also help personalizing cancer treatment decisions. 5,6chine learning (ML) adoption in medicine can aid clinical decisions, improving health care quality. 7ML methods have been applied to predict health outcomes, including postsurgery complications, 8 stroke rehabilitation success, 9 epilepsy, 10 or mortality. 11ML models can also be successful in predicting hospital utilization.For instance, binary classifiers were used to robustly predict hospital admissions on the basis of emergency department triage information and patients' medical history. 12Furthermore, ML algorithms were applied to electronic health records (EHR) to predict chemotherapy-related hospital admissions. 5wever, these models did not include any information gathered from patients about their own health and wellbeing.Therefore, current AI models process the clinical information well, without consideration of patients' perspective on their health.
Patient-reported outcome measures (PROMs) are questionnaires that measure patients' perception on their own health status, 13 including disease-related symptoms, side effects of treatments, quality of life, and impact on functioning.PROMs are increasingly incorporated in routine clinical care and can be used as predictors in ML methods foreseeing health outcomes, 14,15 for example, identifying patients at risk of experiencing undesirable clinical outcomes. 16ML algorithms trained on patient-reported and clinical data accurately predicted financial toxicity in patients with early breast cancer. 17Furthermore, PROMs enhanced ML performance predicting 5-year cancer survival, when added to clinical and sociodemographic variables. 18evertheless, the benefits of inclusion of PROMs as predictors are inconsistent, as some studies did not find PROMs to have as meaningful impact on model performance as objective measures. 19,20e variability in effectiveness of PROMs in predicting patient outcomes may be caused by inconsistent performance metrics and conclusions drawn from data affected by inappropriate preprocessing methods, such as balancing data sets before creating training and testing sets, which often introduces bias. 21The lack of methodologic agreement and guidance in the literature indicates the need for comparison of frequently used methods.The predictive value of PROMs is also not explored in detail because of the variety of PROMs currently used. 15Therefore, this paper aims to address five research questions: 1. Do PROMs add predictive value to ML models?2. Which PROMs are the most useful in predictions?3. Which ML models have the best performance?4. Did preprocessing method for handling class imbalance affect model performance? 5. Which features were the most important for prediction?

Data Set
Data from 508 patients initiating systemic treatment for colorectal, breast, or gynecologic cancers at Leeds Cancer Centre (United Kingdom), collected in an eRAPID clinical trial between January 22, 2015, and June 11, 2018, 22 were used in this study.The data set contained 35 variables.Eight variables were clinical or demographic, collected from EHR.They included age at study entry, sex (male/female), number of days on study from the start of chemotherapy, study arm, disease site (breast/gynecologic/colorectal), previous chemotherapy (yes/no), information if the disease was metastatic or nonmetastatic, and the number of comorbidities (from the list: cardiovascular, respiratory, gastrointestinal, stomach/intestine, endocrine, renal, neurologic, rheumatologic, previous malignancy, and substance abuse).Twenty-four variables were from PROMs completed by participants on paper at the time of study entry.Fifteen of these PROMs were from European Organisation for Research and Treatment of Cancer Core Quality of Life Questionnaire (QLQ-C30) 23 with 30 items, containing information about participants' physical symptoms, perception on their physical function, emotional and social function, and overall health and quality of life.Another five PROM variables were from Five-Dimensional Visual Analogue Scale (EQ-5D), 24 including self-reported data on mobility, self-care, usual activities, pain/discomfort, and anxiety/depression.Four remaining PROM variables were aggregated scores of physical, social, emotional, and functional well-being from Functional Assessment of Cancer Therapy-General (FACT-G) 28 items. 25Three target variables were the number of hospital admissions, triage events (patients contacting emergency helpline or attending oncology admission unit), and changes to chemotherapy during the 18-week clinical trial.This information was extracted from EHR.The variables were selected because of their availability from the eRAPID clinical trial 22 and the consultation with clinicians regarding their relevance.

Variable Preparation
The overview of the methods is presented in Figure 1.Target features were transformed to binary variables with class 0 (no event) or 1 (at least one event) to enable binary classification. 26,27To allow in-depth exploration of all PROM effects on the model performance in general, and when individual questionnaires are separately added to clinical data, six different feature sets were created with following variables: 1.Only clinical 2. All available 3. Only PROMs 4. Clinical 1 QLQ-C30 5. Clinical 1 EQ-5D 6. Clinical 1 FACT-G Continuous variables were scaled to unit variance to improve computational performance of ML. 28 To prevent algorithms from receiving repeated information, 29 correlated variables were removed from each feature set (leaving one), so that no Pearson coefficient higher than 0.6 was left. 30The list of variables in each feature set is presented in Appendix Tables A1 and A2, including differences between classes.

Missing Data Imputation
All patients completed QLQ-C30, EQ-5D, and FACT-G at the clinical trial baseline.However, for 91 participants whose data were taken from the pilot study of the trial, only two subscales of QLQ-C30 were included, so patients from this phase did not have full QLQ-C30 data.The records from these participants were removed from affected feature sets (all variables, only PROMs, and clinical 1 QLQ-C30 variables).Using complete case analysis (CC) is justified under the missing completely at random assumption.Pilot trial ensures random selection of participants, so CC method is unlikely to bias results. 31Any further cases of missing values were infrequent and likely resulted from participants omitting questions, which is a common issue in PROM data. 32hey were imputed using K-nearest neighbors algorithm (k 5 5), being a common imputation method in relevant studies. 18,26,33,34ndling Class Imbalance To mitigate potential bias of class imbalance, 21 synthetic participants in minority class can be created to match the number of participants in the majority class (oversampling).In previous studies, it was performed on the entire data set 18,35 or training set only. 36,37ML can also be trained on original data and evaluated using multiple performance metrics. 33Since there is no consistency in data preprocessing methods, the model performances in these scenarios were compared to discover bias in the results.Therefore, three data sets were created from each of the six feature sets for all target variables.

ML Model Development
Six ML models, namely, logistic regression (LR), decision tree (DT), adaptive boosting (AB), random forest (RF), support vector machines (SVMs), and neural network (NN), were selected on the basis of their inclusions in previous research. 18,37Hyperparameter tuning was performed on training sets through grid search with five-fold cross-validation. 37The models were applied using Python sklearn library.

Model Evaluation
Accuracy, precision, recall (also known as sensitivity), F1 score, and AUC were used to evaluate model performance.AUC, a commonly used metric in ML studies, was considered a main metric for model evaluation to enable between-studies comparisons.Model calibrations were evaluated with calibration plots of RF in balanced data sets and LR in remaining data sets because of the best overall performance of these models in these scenarios.Feature importance analyses were also performed on these models.LR features were analyzed through the absolute values of regression coefficients.This method is only meaningful for standardized data with no multicollinearity, 38 which was accounted for by standardization of features and removing correlated variables.RF features were explored through "feature importances" python command in sklearn.RandomForestClassifier. Analysis of variance with Tukey's honest significant difference tests were performed to compare model performances (Appendix Table A3).

RESULTS
Performance metrics and hyperparameters for all models applied to all feature sets for all preprocessing methods are presented in Appendix Table A4.

Overall Predictive Value of PROMs
For all models in original and balanced data sets, clinical variables had worse AUC than feature sets including PROMs.
In the partially balanced data set, F1 score was higher for clinical variables in SVM (0.188) and NN (0.493) than for other feature sets.Nevertheless, these values were not the highest overall.For SVM, recall was also the highest value for the clinical variables (0.176).No AUC was the highest for clinical variables.

Predictive Value of Individual PROM Questionnaires
In the original data set, clinical 1 QLQ-C30 variables achieved the best AUC in all models except NN (AUC was highest for PROMs only).In the balanced data set, clinical 1 QLQ-C30 variables obtained the highest AUC in all models, apart from LR and SVM (AUCs were highest for all variables).In the partially balanced data set, the highest AUC was obtained by clinical 1 QLQ-C30 variables in LR, DT, RF (the same value as PROMs only), and NN.In AB, the highest AUC was achieved by all variables, and in SVM by clinical 1 FACT-G variables.Therefore, QLQ-C30 variables aided ML performance the most.

Model Performance
In the original data set, LR performed best in all feature sets, except for clinical 1 EQ-5D, where DT was superior (Fig 2A).
The highest AUC (0.659) was obtained by LR with clinical 1 QLQ-C30 variables.In the balanced data set, RF was the best performing algorithm (highest AUC 5 0.905) for all feature sets, apart from all variables and clinical 1 EQ-5D variables, where SVM performed slightly better.In the partially balanced data set, the best AUC (0.616) was achieved by LR using clinical 1 QLQ-C30 variables and AB using all variables.Balancing the entire data set improved model performance on the basis of all evaluation metrics (Figs 2B and 2E).Using partially balanced data resulted in similar AUCs and precision to original data, but improved F1 score and recall.Model calibration for predicting admissions is poor in original data and improves slightly in balanced and partially balanced scenarios (Fig 3).LR prioritized clinical variables, while RF focused on PROMs and age at study entry (Table 1).

Overall Predictive Value of PROMs
No AUC was highest for clinical variables in any of the models and data sets, suggesting that PROMs improved model performance.The only highest values obtained by only clinical variables were F1 score and recall in original data (NN) precision in balanced data (AB), but these were not the highest values considering all models.

Predictive Value of Individual PROM Questionnaires
In the original data set, feature sets achieving the highest AUCs were only PROM variables for LR, DT, and RF; all variables for SVM and NN; and clinical 1 QLQ-C30 variables for AB.In the balanced data set, all variables obtained the   1).

Overall Predictive Value of PROMs
In the original data set, AUC was the highest for clinical variables only in DT (0.623) and RF (0.623).However, these values were not the highest overall.In the balanced data set, recall was the only measure highest for clinical variables in AB (0.754) and NN (1).In the partially balanced data set, clinical variables obtained the highest precision (0.872) and AUC (0.682) in LR and SVM, respectively.Overall, highest AUC had models including PROMs.

Predictive Value of Individual PROM Questionnaires
In the original data set, clinical 1 QLQ-C30 variables had the highest performance in LR and SVM; clinical 1 FACT-G variables for AB and NN; and clinical variables for DT and RF.In the balanced data set, clinical 1 QLQ-C30 variables obtained the highest AUC in DT and AB, clinical 1 EQ-5D variables for NN, and all variables for LR and RF.In SVM, all variables and clinical 1 QLQ-C30 variables achieved the same AUC (0.931).In the partially balanced data set, all variables obtained the highest AUC for DT, AB, and RF; only clinical variables for LR and SVM; and clinical 1 EQ-5D variables for NN.

Model Performance
Overall, models predicting changes to chemotherapy performed significantly better than models predicting triage (P < .01)and admissions (P < .001).No model in original and partially balanced data sets outperformed others.In the balanced data set, the best algorithms were RF and SVM.SVM with all and clinical 1 QLQ-C30 variables had the best overall performance (AUC 5 0.931).The AUCs of the models increased when data were balanced, but there was no difference in F1 scores.There was no noticeable difference between original and partially balanced data sets (Figs 5D and 5F).Model calibration was very good in the partially balanced data set and slightly worse in other data types (Fig 3).LR prioritized clinical variables with individual FACT-G and EQ-5D features.RF mainly considered FACT-G and some clinical variables (Table 1).

DISCUSSION
We successfully applied a range of ML models to a complex oncology data set with clinical, PROM, and health outcome data.PROMs improved the overall performance of ML models for all target variables.Sometimes the best performing models included only PROM variables.Although there is evidence suggesting that using PROMs without objectively measured data in ML models can lead to accurate predictions, 15 this study encourages using both clinical and PROM data.The QLQ-C30 questionnaire added the most predictive value overall.This might be explained by QLQ-C30 being the only questionnaire with variables consistently significantly different between classes.These results are promising, as the wide availability of QLQ-C30 40 may aid its utilization in ML models for clinical practice.LR being the simplest model and outperforming other methods in imbalanced data was also observed in previous studies. 26,35Good performance of RF and SVM when predicting admissions and changes to chemotherapy in balanced data is compatible with ensemble methods of previously reported outcome predictions. 17,18,41Changes to chemotherapy predictions had the best overall performance, which is further confirmed by great calibration of models predicting this target in the partially balanced data set.This might be explained by more frequent and stronger significance of feature differences between classes.Poor performance of triage predictions could be due to more subjective nature of this target, compared with clinical decision to admit a patient or make treatment changes.Balancing data sets improved overall model performance.Using the balanced data set might decrease generalizability of models, as oversampling often causes overfitting. 42Therefore, evaluating models on the balanced testing set prevents the models from applications in clinical practice, as the real-world data are never perfectly balanced.Nevertheless, training models on imbalanced data can lead to incorrect predictions, biased toward one of the classes, 43 which was apparent through low recall in admission predictions, being the most imbalanced target.Using the partially balanced data set mitigates such bias and the lack of generalizability.This method ensures robustness of models through the balanced training set and obtains a more accurate perspective for real clinical data through the original testing set. 44 all target variables, LR models focused more on clinical features than PROMs.RF models usually favored FACT-G variables with some relevant clinical or QLQ-C30 features (mainly sleep, cognitive, and social scales).Although these patterns were similar for all target variables, changes to chemotherapy predictions showed the smallest discrepancy between the feature ranks.It might be explained by the best predictive performance of this outcome.LR was often the best performing model in original data, which could explain its poor performance of predicting triage and admissions, as the clinical features for these targets did not have significant differences between classes (Appendix Table A3), yet the model was considering these variables the most important (Table 1).For changes to chemotherapy, there were many significantly different clinical features, explaining good performance of LR.RF usually favored PROMs, which explained this model struggling to predict outcomes from only clinical variables.
Inclusion of three different PROMs allowed understanding of their individual predictive value.This study addressed the inconsistency in preprocessing methods for class imbalance in existing studies 18,33,[35][36][37] and highlighted differences in results generated from these three techniques.The variety of performance metrics reported allowed between-studies comparison 15 and in-depth understanding of models.Furthermore, consulting clinicians during study design ensured clinical relevance of research questions, which can support adoption of ML methods. 39e limitations of this study include clinical trial data collection, which might not be representative of the population. 45No information about patients' ethnicity was provided, which limited understanding of potential bias in data 46 and prevented subgroup analyses. 33Small sample size is associated with higher accuracy in classification, 47 so using more data would prevent potential bias.Furthermore, this work used only PROMs collected at the beginning of chemotherapy (baseline), so potential over-time dependencies of patient reports were missed.Half of the participants used clinical trial intervention, which might have affected the outcome, but this risk was mitigated through performance comparison in control and intervention groups, identifying no significant differences.
In conclusion, this study supported the evidence that PROMs, such as health-related quality of life, functionating, and symptom reporting, can improve the performance of ML models predicting patient outcomes.The predictive value of widely available PROMs, such as the QLQ-C30 questionnaire, supports the motivation for collecting and using these measures in ML research.The results inform further exploration of PROMs' effect as predictors, and potential application of ML models in clinical practice, if rigorous justification and reporting of methodology is performed.On the basis of large discrepancies across results from different preprocessing methods, this research alerts scientific community to justify choices on the methods for balancing data.It is recommended to balance the training set only and to test models on original data to prevent bias.In future work, we plan to involve patients and clinicians to assess their attitudes to ML-based prediction and to explore the broader implications of the findings.We also plan to use PROM data collected longitudinally throughout chemotherapy treatment, as over-time changes in reporting might provide more meaningful conclusions.

FIG 3 .
FIG 3. Calibration plots of LR models in original dataset (A), calibration of RF models in balanced datasets (B), and calibration of LR models in partially balanced datasets (C).LR, logistic regression, RF, random forest.
Flow diagram illustrating the methodology of the study.AB, adaptive boosting; DT, decision tree; EQ-5D, EuroQol Five-Dimensional Visual Analogue Scale; FACT-G, Functional Assessment of Cancer Therapy-General; KNN, k-nearest neighbors; LR, logistic regression; ML, machine learning; NN, neural network; PROMs, patient-reported outcome measures; QLQ-C30, European Organisation for Research and Treatment of Cancer Core Quality of Life Questionnaire; RF, random forest; SVMs, support vector machines.

TABLE 1 .
Feature Importance for LR and RF Models Predicting All Three Target Variables highest AUC for DT, AB, RF, and SVM.Clinical 1 QLQ-C30 variables had the highest performance for LG, and clinical 1 EQ-5D variables for NN.In the partially balanced data set, clinical 1 EQ-5D variables had the best performance the most frequently (for AB, RF, SVM, and NN), and clinical 1 FACT-G variables were selected twice (for LR and DT).(0.624) was obtained by LR in the clinical 1 FACT-G feature set.There was no outstanding model in the original data set, but in balanced and partially balanced data sets, SVM, RF, and NN provided the best F1 scores (Figs4E and 4F).A slight increase in the AUCs for the balanced data set is noticeable in Figure4B.AUCs and F1 scores in the partially balanced data set were similar to the original data set.For some models in balanced data, the F1 scores were lower than in other data sets.Model calibration remained poor across different data types (Fig3).LR's main features comprised clinical data with individual PROM variables, while RF primarily considered FACT-G variables (Table

TABLE 1 .
Feature Importance for LR and RF Models Predicting All Three Target Variables (continued) Abbreviations: EQ-5D, EuroQol Five-Dimensional Visual Analogue Scale; FACT-G, Functional Assessment of Cancer Therapy-General; LR, logistic regression; QLQ-C30, European Organisation for Research and Treatment of Cancer Core Quality of Life Questionnaire; RF, random forest.

TABLE A1 .
Variables in Each Feature Sets

TABLE A2 .
EuroQol Five-dimensional Visual Analogue Scale; FACT-G, Functional Assessment of Cancer Therapy-General; PROMs, patient-reported outcome measures; QLQ-C30, European Organisation for Research and Treatment of Cancer Core Quality of Life Questionnaire.Differences in Features Between Classes in All Target Variables and the P Value Generated by Using Mann-Whitney U Test for All Features but AgeStudyEntry (t-test used), as It Was the Only Normally Distributed Variable

TABLE A2 .
Differences in Features Between Classes in All Target Variables and the P Value Generated by Using Mann-Whitney U Test for All Features but AgeStudyEntry (t-test used), as It Was the Only Normally Distributed Variable (continued)

TABLE A2 .
Differences in Features Between Classes in All Target Variables and the P Value Generated by Using Mann-Whitney U Test for All Features but AgeStudyEntry (t-test used), as It Was the Only Normally Distributed Variable (continued)

TABLE A3 .
Results of ANOVA Which Provided P < .05for Outcomes, Preprocessing, and Model Comparisons

TABLE A4 .
Results of the Six Models for Each Target Variable, Each Feature Set, and Each Preprocessing Method Addressing Class Imbalance With Hyperparameters Selected Through Grid Search

TABLE A4 .
Results of the Six Models for Each Target Variable, Each Feature Set, and Each Preprocessing Method Addressing Class Imbalance With Hyperparameters Selected Through Grid Search (continued)